Module 1 - Lab 2
Word Embeddings & Neural Representations
Lab Overview
This lab transitions you from count-based representations (Bag of Words) to learned semantic representations (word embeddings), which form the foundation of modern neural NLP and generative AI systems.
- Bag-of-Words treats language as counts
- Embeddings treat language as geometry
- Neural networks learn representations, not rules
This lab explains how and why that shift happens.
Learning Objectives
After completing this lab, you should be able to:
- Explain the limitations of Bag-of-Words
- Describe how distributional semantics works
- Understand how neural networks learn word meaning
- Connect backpropagation to representation learning
- Articulate why embeddings matter for business analytics
From Bag-of-Words to Distributed Meaning
Bag-of-Words assumes:
- Words are independent
- Order does not matter
- Meaning ≈ frequency
This creates well-known failures:
- Synonyms look unrelated
- Negation is ignored
- Context is lost
This violates a core linguistic principle:
“You shall know a word by the company it keeps.” — J.R. Firth
Embeddings operationalize this idea mathematically.
What Is a Word Embedding?
A word embedding is a dense numerical vector where:
- Each dimension encodes latent semantic features
- Similar words lie close in vector space
- Relationships emerge geometrically
Instead of counting words:
- We predict context
- The prediction error updates vectors
- Vectors move closer/farther based on usage
Meaning is not labeled, it is learned implicitly.
1 Loading SEC 10-K Text Data
- Load raw text documents into memory for tokenization and modeling.
- SEC 10-K filings:
- Are long, unstructured
- Contain domain-specific language
- Are ideal for testing semantic models
- Locates all SEC 10-K text files
- Verifies dataset availability
- Ensures reproducibility across Colab sessions
from pathlib import Path
# Colab Specific: Adjust path to your Google Drive location
# DATA_DIR = Path("/content/drive/MyDrive/SEC-10K-2024")
# files = sorted(DATA_DIR.glob("*.txt"))
# Local drive only do this if you have the data on your local machine
DATA_DIR = Path("../../data/SEC-10K-2024")
files = sorted(DATA_DIR.glob("*.txt"))
len(files)2 Minimal Tokenization
- Tokenization converts raw text into atomic units (tokens) that models can process.
- We intentionally keep this simple to:
- Avoid hiding complexity
- Emphasize representation learning
- Focus on meaning, not preprocessing tricks
- Lowercasing ensures consistency
- Removing punctuation reduces noise
- Filtering short tokens removes artifacts
import re
def tokenize(text):
text = re.sub(r"[^a-z\s]", " ", text)
return [t for t in text.split() if len(t) > 2]- Creates clean word sequences
- Preserves semantic structure
- Prepares data for neural learning
3 Training Word Embeddings (Word2Vec)
Word2Vec trains a shallow neural network that learns word vectors by:
- Predicting nearby words (Skip-gram), or
- Predicting a word from its context (CBOW)
One-hot word → Embedding layer → Context prediction → Loss
- The embedding layer is the model.
- Learns 100-dimensional word vectors
- Uses local context (window=5)
- Ignores rare/noisy terms
- Trains via stochastic gradient descent
from gensim.models import Word2Vec
model = Word2Vec(
sentences=tokenized_docs,
vector_size=100,
window=5,
min_count=5,
workers=2
)4 Inspecting Learned Meaning
Once trained: - Distance = similarity - Direction = relationship - Arithmetic ≈ semantics - No rules were written. - No labels were provided. - Yet meaning emerges.
model.wv.most_similar("risk", topn=5)- Which concepts the model associates
- How business language clusters
- Whether learning aligns with intuition
5 How Does the Neural Network Learn Meaning?
5.1 Key Idea
The network is not told what words mean.
It:
- Makes a prediction
- Measures error
- Updates vectors
Meaning is the byproduct of optimization.
6 Backpropagation — Conceptual Explanation
6.1 What Backpropagation Does
Backpropagation computes:
How much should each parameter change to reduce error?
6.2 In Embeddings
- Parameters = word vectors
- Loss = incorrect context prediction
- Gradient = direction to move vectors
Each training step nudges vectors into better semantic positions.
7 Why This Matters for Generative AI
7.1 Conceptual Continuity
Large Language Models:
- Still use embeddings
- Still use backpropagation
- Still optimize prediction error
What changes is:
- Scale
- Architecture depth
- Training data size
The core idea remains identical.
8 Business Relevance
| Representation | Business Value |
|---|---|
| Bag-of-Words | Audits, baselines |
| Embeddings | Similarity, clustering |
| Neural models | Prediction, generation |
8.1 SEC 10-K Applications
- Risk similarity detection
- Peer benchmarking
- Topic drift analysis
- Early warning signals
9 Deliverables
Answer the following (conceptual, not code-heavy):
- Why do embeddings outperform Bag-of-Words for financial text?
- How does backpropagation enable semantic learning?
- Name one business analytics task improved by embeddings.
10 Conceptual Takeaway
Bag-of-Words counts language. Embeddings learn language.
This lab completes your transition from classical text analytics to neural NLP foundations.